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A parallelizable SSOR preconditioning scheme for Krylov subspace iterative solvers in lattice QCD applica- 
tions involving Wilson ferrnions is presented. In actual Hybrid Monte Carlo simulations and quark propagator 
calculations it helps to reduce the number of iterations by a factor of 2 compared to conventional odd-even 
preconditioning. This corresponds to a gain in cpu-time of 30% - 70% over odd-even preconditioning. 



1. INTRODUCTION 

Efficient numerical algorithms to solve huge 
sparse systems of linear equations are needed to 
reduce the enormous computer power required in 
lattice QCD computations. In the simulation of 
full QCD, as well as in the calculation of Greens 
functions or propagators to determine the proper- 
ties of hadrons, as e.g. the spectrum, weak decay 
constants or weak matrix elements, the computa- 
tional bottleneck is to determine the solution of 
the discretized Dirac equation: 



Mx = 



(1) 



While iterative solving methods almost come to 
a limit when it comes to reducing the number 
of iterations, preconditioning techniques become 
important to further accelerate the inversion. 

In this contribution, we wish to advocate the 
use of general parallel SSOR preconditioning 
techniques in lattice QCD. Our approach may be 
regarded as a generalization of the well known 
odd-even (two variety) ordering to a more flexi- 
ble (many variety) layout or, alternatively, as a 
localization of the globally lexicographic ordering. 

2. PRECONDITIONING 

To precondition eq. ^, we take two non-singular 
matrices V\ and Vi which act as a left and a right 



preconditioner, i.e. we consider the new system 



V^MV^x 



cp = V{~ V, x = V 2 x. 



(2) 



We could now apply efficient solvers like 
BiCGstab replacing each occurrence of M and <f> 
by V-[~ MV% an d 0j respectively. 

The purpose of preconditioning is to reduce the 
number of iterations and the computing time nec- 
essary to achieve a given accuracy. This means 
that (a) V = V1V2 has to be a sufficiently good 
approximation to the inverse of M and (b) find- 
ing solutions for V\ and V2 should be sufficiently 
cheap. 

Consider the decomposition of M into its diag- 
onal, strictly lower L and strictly upper U trian- 
gular parts 

M = 1 - L-U . 

The SSOR preconditioner is given by 

V X =I-L, V 2 =I-U. (3) 

For the SSOR preconditioner we have V\ + V2 — 
M = I. This relation can be exploited through 
the 'Eisenstat-trick' 0: with V^MVf 1 = 
V^" 1 + VT (I — V^ 1 ), the matrix vector product 
w — V{~ MV<2 r can economically be computed 
in the form 



v = KrV 



u = V 1 1 (r — v), 
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Note that multiplications with M are com- 
pletely avoided in this formulation, the only ma- 
trix operations being multiplied with are I — L 
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and I — U. Since these matrices are triangular, 
the solutions can be computed directly via for- 
ward or backward substitution. 

The preconditioned residuals fj are related to 
the unpreconditioned residuals [E| . Upon success- 
ful stopping, one can compute and restart if the 
solution is not yet accurate enough. 

3. ORDERINGS 

In eq. ^ with the Wilson fermion matrix M we 
have the freedom to choose any ordering scheme 
for the lattice points x. Different orderings yield 
different matrices M, which are permutationally 
similar to each other. Consider an arbitrary 




Figure 1. Random, natural and oe orderings of 
the lattice sites defining three matrix multiplica- 
tions which are permutationally equivalent. 

ordering of the lattice points. For a given grid 
point x, the corresponding row in the matrix L 
or U contains exactly the coupling coefficients 
of those nearest neighbors of x which have been 
numbered before or after x, respectively. There- 
fore, a generic formulation of the forward or back- 
ward solution for this ordering is given by the 
rules 

• touch every site 

• forward solve: respect site numbered before 

• backward solve: respect site numbered after 

In this context, the odd-even ordering, so far 
generally considered as the only successful pre- 
conditioner in a parallel computing environment, 
is seen to be a specific example of SSOR pre- 
conditioning (in odd-even ordering all odd lat- 
tice points are numbered before the even ones). 



In traditional QCD computations, the odd-even 
preconditioning is not implemented by using the 
above formulation of the forward (and backward) 
solvers, as for this particular ordering the inverses 
of I — L and I — U can be determined directly. 

Defining M with the natural ( global lexico- 
graphic [gl]) ordering (fig.l) leads to a further im- 
provement over odd-even preconditioning as far 
as the number of iterations is conccr ned § . How- 
ever, its parallel implementation turned out to be 
impractical 

4. PARALLELISATION 

Unlike the lexicographical ordering, the order- 
ing we propose now is adapted to the parallel 
computer used to solve eq. |l| We assume that 
the processors of the parallel computer are con- 
nected as a pi x p2 x j>3 x p^ 4-dimensional grid. 
The space-time lattice can be matched to the pro- 
cessor grid in a natural manner, producing a lo- 
cal lattice of size n l °° x 7"4° c x n l ^ c x n l £ c with 
n\ oc = rii/pi on each processor. 

Let us partition the whole lattice into n loc = 
n ioc n ioc n ]oc n ioc g r0U p S> Each group corresponds 

to a fixed position of the local grid and contains 
all grid points appearing at this position within 
their respective local grid. 

We now consider a natural ordering on each 
local grid, which allows a coherent update of 
corresponding sites on each processor in paral- 
lel. Inter-node communication has to be done 
when the local grid border is touched. Fig.^ 
shows the speedups of various local lattices sizes 
(16 = 2 4 , 64 = 2 2 • 4 2 , 128 = 2 • 4 3 , 256 = 2 ■ 8 • 4 2 ). 
It should be noted that the number of floating 
point operations per iteration is the same in each 
case. 

5. APPLICATION 

Our numerical tests of the locally lexico- 
graphic SSOR preconditioner were performed on 
APEIOO/Quadrics machines (Q4,QH4), a SIMD 
parallel architecture optimized for fast floating 
point arithmetic on block data-structures like 
3x3 SU(3) matrices. We applied the 11 precondi- 
tioner to quark propagator calculations and chi- 
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Figure 2. Local lexicographic ordering with dif- 
ferent local volumes using BiCGStab. Reference 
point (O) refers to o-e preconditioning Q. 



Figure 3. CPU times needed for the quark prop- 
agator calculation on a 8 3 x 16 lattice at /3 = 5.6 
on the Q4 with local lattice size 256 = (4, 4, 8, 2). 



ral Hybrid Monte Carlo simulations with Wilson 
fermions on large lattices. The CPU time gains 
for the computation of propagators are shown in 
fig. 3. In our current HMC implementation on a 
QH4 with a 24 3 x 40 lattice close to the chiral 
regime we found a speedup in convergence rate 
by a factor 2, corresponding to a gain in CPU 
time of 70% in our implementation. 

6. CONCLUSION 

We have presented a new local grid point or- 
dering scheme that allows to carry out efficient 
preconditioning of Krylov subspace solvers. The 
number of iterations as well as the required float- 
ing point operations are reduced by a factor of 
2 for local lattice sizes (> 128 ). This corre- 
sponds to an implementation and machine depen- 
dent gain in CPU time of 30% - 70%. 
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